Introduction

This file contains practice and exam questions suited to test your skills and understanding. It also illustrates the procedure of our mid-term exam (on June 4, 2018) and final exam (on July 16, 2018).

This exam contains a total of XXX tasks and a maximum score of YYY points.

Course coordinates

spds.uni.kn

Preparation and response format

0. Please answer the following questions by creating a single R script (or an R-Markdown file .Rmd) that contains all your code and answers and meets the following criteria:

  • Layout issues:

    1. Include a header that contains your name, student ID, this course, and today’s date.

    2. Load the R packages of the tidyverse.

    3. Structure your file clearly by labeling the current task (e.g., # Task 1: -----) and subtask (e.g., # (a) ...:) and by leaving blank lines between all tasks and subtasks.

    Here’s a layout template that you can copy and adapt:

## Final exam  | Data science for psychologists (Summer 2018)
## Name: ... | Student ID: ...
## 2018 07 16
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##

## Preparations: ----- 

library(tidyverse)


## Task 1: ----- 

# (a) Save data as tibble and inspect data:
pg <- as_tibble(PlantGrowth)
pg

## Answer: The PlantGrowth data contains 30 cases (rows) and 2 variables (columns). 

## (b): ... 
## ...


## Task X: ----- 
## (a) ... 

## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- 
  • Save your script (regularly) as Lastname_Firstname_midTerm_180604.R (replacing Lastname and Firstname by your names).

  • When asked for numbers or interpretations, include short answers as comments in your script. However, when asked for quantitative summaries containing more than 2 numbers (e.g., descriptive statistics of a dataset) simply print your results (e.g., the output of a dplyr pipe) in your code.

  • Submit your script as an attachment to an email (with the subject header “ds4psy: final exam”) to h.neth@uni.kn by Wednesday (July 18th, 2018).

3: Data visualisation

Key skills

The chapter introduces visualizations (with ggplot), but not yet data transformations (with dplyr). Key skills to be acquired in this context include creating:

  • scatterplots (geom_point) and trendlines (geom_line and geom_smooth),
  • grouping (via aesthetic mappings like color, shape, size, etc.),
  • facetting (facet_wrap and facet_grid),
  • bar charts (geom_bar) for given values (stat = "identity"), frequency counts, or proportions, for different bar positions (stack, fill, and dodge),
  • boxplots (geom_boxplot), and
  • adjusting visual aspects (colors, shapes, themes), labels (e.g., plot titles and captions), and coordinates (coord_cartesian, coord_flip, and coord_polar).

Example tasks

Task 1: Growing plants

The PlanthGrowth data (contained in R datasets) reports the results from an experiment that compares growth yields (measured by the dried weight of plants) obtained under 2 treatments vs. a control condition.

1a. Save the PlantGrowth data as a tibble and inspect its dimensions. (1 point)

1b. Use a dplyr pipe to compute the number of observations (rows) in each group and some key descriptives (their mean, median, and standard deviation). (2 points)

1c. Use ggplot to create a graph that shows the medians and raw values of plant weight by group. (2 points)

Hints: Use 2 different geoms to show both medians and raw values in the same plot. The order of layers is determined by the order of geom commands.

# ?datasets::PlantGrowth

# (a) Save as tibble and inspect data:
pg <- as_tibble(PlantGrowth)
pg # => 30 cases (rows) x 2 variables (columns)
#> # A tibble: 30 x 2
#>    weight group
#>     <dbl> <fct>
#>  1   4.17 ctrl 
#>  2   5.58 ctrl 
#>  3   5.18 ctrl 
#>  4   6.11 ctrl 
#>  5   4.50 ctrl 
#>  6   4.61 ctrl 
#>  7   5.17 ctrl 
#>  8   4.53 ctrl 
#>  9   5.33 ctrl 
#> 10   5.14 ctrl 
#> # ... with 20 more rows

# (b) Compute number of observations by group, their mean, median and standard deviation: 
pg %>%
  group_by(group) %>%
  summarise(count = n(),
            mn_weight = mean(weight),
            md_weight = median(weight),
            sd_weight = sd(weight)
            )
#> # A tibble: 3 x 5
#>   group count mn_weight md_weight sd_weight
#>   <fct> <int>     <dbl>     <dbl>     <dbl>
#> 1 ctrl     10      5.03      5.15     0.583
#> 2 trt1     10      4.66      4.55     0.794
#> 3 trt2     10      5.53      5.44     0.443

# (c) Plot the median and raw values of weight by group: 
ggplot(pg, aes(x = group, y = weight)) +
  # geom_violin() +
  geom_boxplot(aes(fill = group)) +
  geom_point(aes(shape = group), alpha = 2/3, size = 4, position = "jitter") + 
  ## Pimping plot: 
  labs(title = "Plant weight by group", x = "Group", y = "Weight", 
       caption = "Data from datasets::PlantGrowth.") + 
  scale_fill_manual(values = c("grey75", "gold", "steelblue3")) +
  theme_bw()

pt <- pt + 5  # increment point total

Task 2: Time to sleep

The sleep data (contained in R datasets) shows the effect of 2 sleep-promoting drugs (as an increase in hours of sleep compared to a control group) for 10 patients.

2a. Save the sleep data as a tibble sp and inspect its dimensions. (1 point)

2b. Use a dplyr pipe to compute the number of observations (rows) by group
and key descriptives (their mean, median, and standard deviation). (2 points)

2c. Use ggplot to create a graph that shows the medians and raw values of extra sleep time by group. (2 points)

Hints: Use 2 different geoms to show both medians and raw values in the same plot. The order of layers is determined by the order of geom commands.

2d. Reformat the sleep data in sp so that the 2 groups appear in 2 lines (rows) and 10 subject IDs as 10 columns. (1 point)

Hints: This implies using the tidyr::spread command to change a long dataset into a wider format.

# ?datasets::sleep

# (a) Save data as tibble and inspect:
sp <- as_tibble(sleep)
sp # => 20 cases (rows) x 3 variables (columns)
#> # A tibble: 20 x 3
#>     extra group ID   
#>     <dbl> <fct> <fct>
#>  1  0.700 1     1    
#>  2 -1.60  1     2    
#>  3 -0.200 1     3    
#>  4 -1.20  1     4    
#>  5 -0.100 1     5    
#>  6  3.40  1     6    
#>  7  3.70  1     7    
#>  8  0.800 1     8    
#>  9  0     1     9    
#> 10  2.00  1     10   
#> 11  1.90  2     1    
#> 12  0.800 2     2    
#> 13  1.10  2     3    
#> 14  0.100 2     4    
#> 15 -0.100 2     5    
#> 16  4.40  2     6    
#> 17  5.50  2     7    
#> 18  1.60  2     8    
#> 19  4.60  2     9    
#> 20  3.40  2     10

# (b) Compute the number of observations and descriptives of extra time by group:
sp %>% 
  group_by(group) %>%
  summarize(n = n(),
            md = median(extra),
            mn = mean(extra),
            sd = sd(extra))
#> # A tibble: 2 x 5
#>   group     n    md    mn    sd
#>   <fct> <int> <dbl> <dbl> <dbl>
#> 1 1        10 0.350 0.750  1.79
#> 2 2        10 1.75  2.33   2.00

# (c) Visualize the raw values and averages by group:
ggplot(sp, aes(x = group, y = extra)) +
  geom_boxplot(aes(fill = group)) +
  # geom_violin() +
  geom_point(aes(shape = group), size = 4, alpha = 2/3, position = "jitter") + 
  ## Pimping plot: 
  labs(title = "Extra sleep time by treatment group", 
       x = "Treatment group", y = "Extra sleep time", 
       caption = "Data from datasets::sleep.") + 
  scale_fill_manual(values = c("gold1", "steelblue3")) +
  theme_bw()

# (d) Reformat data so that the 2 groups appear in 2 lines, 
#     and 10 subject IDs as 10 columns:
sp %>%
  spread(key = ID, value = extra)
#> # A tibble: 2 x 11
#>   group   `1`    `2`    `3`    `4`    `5`   `6`   `7`   `8`   `9`  `10`
#> * <fct> <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1     0.700 -1.60  -0.200 -1.20  -0.100  3.40  3.70 0.800  0     2.00
#> 2 2     1.90   0.800  1.10   0.100 -0.100  4.40  5.50 1.60   4.60  3.40

pt <- pt + 6  # increment point total

Task 3: Dietary chicks

The ChickWeight data (contained in R datasets) contains the results of an experiment that measures the effects of Diet on the early growth of chicks.

3a. Save the ChickWeight data as a tibble and inspect its dimensions. (1 point)

3b. Create a line plot showing the weight development of each indivdual chick (on the y-axis) over Time (on the x-axis) for each Diet (in 4 different facets). (2 points)

# ?datasets::ChickWeight

# (a) Save data as tibble and inspect:
cw <- as_tibble(ChickWeight)
cw # => 578 observations (rows) x 4 variables (columns)
#> # A tibble: 578 x 4
#>    weight  Time Chick Diet 
#>  *  <dbl> <dbl> <ord> <fct>
#>  1   42.0  0    1     1    
#>  2   51.0  2.00 1     1    
#>  3   59.0  4.00 1     1    
#>  4   64.0  6.00 1     1    
#>  5   76.0  8.00 1     1    
#>  6   93.0 10.0  1     1    
#>  7  106   12.0  1     1    
#>  8  125   14.0  1     1    
#>  9  149   16.0  1     1    
#> 10  171   18.0  1     1    
#> # ... with 568 more rows

# (b) Scatter and/or line plot showing the weight development of each chick (on the y-axis) 
#     over Time (on the x-axis) for each Diet (as different facets): 
ggplot(cw, aes(x = Time, y = weight, group = Diet)) +
  facet_wrap(~Diet) + 
  geom_point(alpha = 1/2) +
  geom_line(aes(group = Chick)) +
  geom_smooth(aes(color = Diet)) + 
  labs(title = "Chick weight by time for different diets", x = "Time (number of days)", y = "Weight (in gm)", 
       caption = "Data from datasets::ChickWeight.") +
  theme_bw()

3c. The following bar chart shows the number of chicks per Diet over Time.
We see that the initial Diet groups contain a different numbers of chicks and some chicks drop out over Time:

# (c) Bar plot showing the number (count) of chicks per diet over time: 
ggplot(cw, aes(x = Time, fill = Diet)) +
  geom_bar(position = "dodge") +
  labs(title = "Number of chicks per diet over time", x = "Time (number of days)", y = "Number", 
       caption = "Data from datasets::ChickWeight.") +
  theme_bw()

Instead of re-creating this plot, create a table (or tibble) that shows the same (4 x 12 = 48) data points in 4 rows (for the 4 different types of Diet) and 12 columns (for the different Time points). (2 points)

Hints: In a first step, count the number of chicks per Diet and Time. Next, spread the results into a wider format (to show the time points as different columns).

# (c) Re-create the counts of chicks per diet over time numerically (using `dplyr` and `tidyr`). 
#     as a table: How many chicks are there per diet over time?
cw %>%
  group_by(Diet, Time) %>%
  count() %>%
  spread(key = Time, value = n) 
#> # A tibble: 4 x 13
#> # Groups:   Diet [4]
#>     Diet   `0`   `2`   `4`   `6`   `8`  `10`  `12`  `14`  `16`  `18`  `20`
#> * <fctr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1      1    20    20    19    19    19    19    19    18    17    17    17
#> 2      2    10    10    10    10    10    10    10    10    10    10    10
#> 3      3    10    10    10    10    10    10    10    10    10    10    10
#> 4      4    10    10    10    10    10    10    10    10    10    10     9
#> # ... with 1 more variables: `21` <int>

## Not asked: 
# (x) Plot the weight of each individual chick 
# and the Median weight per diet at the end (Time = 21).
# **Hint:** Filter data for the maximum time and 
#       combine 2 geoms: A boxplot and a scatterplot.
cw %>%
  filter(Time == 21) %>%
  ggplot(., aes(x = Diet, y = weight, fill = Diet)) +
    geom_boxplot() +
    geom_point(size = 4, alpha = 1/2) +
    coord_flip() +
    theme_bw()


pt <- pt + 5  # increment point total

Task 4: Tabular TBC

Using tidyr::table1 data.

The tidyr::table1 shows the number of TB cases for 3 countries and 2 years.

4a. Plot a bar chart that shows the number of cases per country (on the y-axis) as a function of the year (on the x-axis). (2 points)

4b. Format the bars (showing cases per country) in different colors. (1 point)

4c. Provide a suitable plot title and a caption noting the data source. (1 point)

4d. Label each bar with the number of cases. (1 point)

Task 5: Party plots

The following table provides the percentage share of 2 major parties on the last 2 general elections of Germany (based on this link):

Party: Share 2013: Share 2017:
CDU/CSU 41.5% 33%
SPD 25.7% 20.5%
Others ? ?

5a. Create a tibble de that contains this data and the missing (?) values for all other parties so that all shares of an election add up to 100%. (2 points)

5b. Convert your de table into a “tidy” table saved as de_2. (2 points)

Hints: Use tidyr::gather to list the values of all election results in 1 variable called share and make sure that de_2 contains a separate variable (column) that specifies the election year.

5c. Visualize and contrast the election results by a bar chart that contains 2 bars (representing the 2 elections) and the party’s share of votes (as the proportions of each bar). (2 points)

Hints: As the data in de_2 already contains the identity of the values which you want to plot, there is no need to count anything. Showing multiple values in one bar is called a “stack”.

## (a) Create a tibble with the data:
de <- tibble(
    party = c("CDU/CSU", "SPD", "Others"),
    share_2013 = c((.341 + .074), .257, (1 - (.341 + .074) - .257)), 
    share_2017 = c((.268 + .062), .205, (1 - (.268 + .062) - .205))
  )
de$party <- factor(de$party, levels = c("CDU/CSU", "SPD", "Others"))  # optional
de
#> # A tibble: 3 x 3
#>   party   share_2013 share_2017
#>   <fct>        <dbl>      <dbl>
#> 1 CDU/CSU      0.415      0.330
#> 2 SPD          0.257      0.205
#> 3 Others       0.328      0.465

## Check that columns add to 100:
sum(de$share_2013)  # => 1 (qed)
#> [1] 1
sum(de$share_2017)  # => 1 (qed)
#> [1] 1

## (b) Converting de into a tidy data table:
de_2 <- de %>%
  gather(share_2013:share_2017, key = "election", value = "share") %>%
  separate(col = "election", into = c("dummy", "year")) %>%
  select(year, party, share)
de_2
#> # A tibble: 6 x 3
#>   year  party   share
#> * <chr> <fct>   <dbl>
#> 1 2013  CDU/CSU 0.415
#> 2 2013  SPD     0.257
#> 3 2013  Others  0.328
#> 4 2017  CDU/CSU 0.330
#> 5 2017  SPD     0.205
#> 6 2017  Others  0.465

## Note that year is of type character, which could be changed by:
# de_2$year <- parse_integer(de_2$year)

## (c) Bar chart showing proportions for each election:
ggplot(de_2, aes(x = year, y = share, fill = party)) +
  ## (A) 1 bar per election (position = "stack"):
  geom_bar(stat = "identity", position = "stack") +  # 1 bar per election
  ## (B) 3 bars per election (position = "dodge"):  
  # geom_bar(stat = "identity", position = "dodge") +  # 3 bars next to each other
  ## Pimping plot: 
  scale_fill_manual(values = c("black", "red3", "gold")) + # optional
  labs(title = "Partial results of the German general elections 2013 and 2017", 
       x = "Year of election", y = "Share of votes", 
       caption = "Data from www.bundeswahlleiter.de.") +
  theme_classic()

pt <- pt + 6  # increment point total

Tasks

  • Create a data frame or tibble
  • Plot bar chart (with stat = "identity")
  • Create pie chart (with coord_polar)

5: Data transformation

library(tidyverse)    # dplyr
library(nycflights13) # data

Key skills

This chapter illustrates the basic table-manipulation tools (verbs) of dplyr. Key skills conveyed include transforming data by using essential dplyr commands:

  • filter and arrange cases (rows);
  • select and re-arranging variables (columns);
  • computing and adding new variables (with mutate and transmute);
  • computing counts and descriptives of group aggregates (by group_by, summarise, and functions);
  • computing new group-level variables (by grouped mutates).

Example tasks

Task 6: R wars

Let’s tackle the universe with the tidyverse by uncovering some facts about the dplyr::starwars dataset. Answer the following questions by using pipes of basic dplyr commands (i.e., arranging, filtering, selecting, grouping, counting, summarizing).

6a. Save the tibble dplyr::starwars as sw and report its dimensions. (1 point)

6b. Missing values and known unknowns:

  • How many missing (NA) values does sw contain? (1 point)

  • Which individuals come from an unknown (missing) homeworld but have a known birth_year or known mass? (1 point)

6c. Gender issues:

  • How many humans are contained in sw overall and by gender? (1 point)

  • How many and which individuals in sw are neither male nor female? (1 point)

  • Of which species in sw exist at least 2 different gender values? (1 point)

6d. Popular homes and heights:

  • From which homeworld do the most indidividuals (rows) come from? (1 point)

  • What is the mean height of all individuals with orange eyes from the most popular homeworld? (1 point)

6e. Seize and mass issues:

  • Compute the median, mean, and standard deviation of height for all droids. (1 point)

  • Compute the average height and mass by species and save the result as h_m. (1 point)

  • Sort h_m to list the 3 species with the smallest individuals (in terms of mean height). (1 point)

  • Sort h_m to list the 3 species with the heaviest individuals (in terms of median mass). (1 point)

# library(tidyverse)
# ?dplyr::starwars

## (a) Basic data properties: ---- 
sw <- dplyr::starwars
dim(sw)  # => 87 rows (denoting individuals) x 13 columns (variables) 
#> [1] 87 13

## Missing data: ----- 

## (+) How many missing data points?
sum(is.na(sw))  # => 101 missing values.
#> [1] 101

# (+) Which individuals come from an unknown (missing) homeworld 
#     but have a known birth_year or mass? 
sw %>% 
  filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#>   name      height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Yoda          66  17.0 white      green      brown          896   male  
#> 2 IG-88        200 140   none       metal      red             15.0 none  
#> 3 Qui-Gon …    193  89.0 brown      fair       blue            92.0 male  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>


## (x) Which variable (column) has the most missing values?
colSums(is.na(sw))  # => birth_year has 44 missing values
#>       name     height       mass hair_color skin_color  eye_color 
#>          0          6         28          5          0          0 
#> birth_year     gender  homeworld    species      films   vehicles 
#>         44          3         10          5          0          0 
#>  starships 
#>          0
colMeans(is.na(sw)) #    (amounting to 50.1% of all cases). 
#>       name     height       mass hair_color skin_color  eye_color 
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000 
#> birth_year     gender  homeworld    species      films   vehicles 
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000 
#>  starships 
#> 0.00000000

## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald": 
# sw$hair_color[is.na(sw$hair_color)] <- "bald"


## (c) Gender issues: ----- 

# (+) How many humans are there of each gender?
sw %>% 
  filter(species == "Human") %>%
  group_by(gender) %>%
  count()
#> # A tibble: 2 x 2
#> # Groups: gender [2]
#>   gender     n
#>   <chr>  <int>
#> 1 female     9
#> 2 male      26

## Answer: 35 Humans in total: 9 females, 26 male.

# (+) How many and which individuals are neither male nor female?
sw %>% 
  filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#>   name     height  mass hair_color skin_color  eye_color birth_year gender
#>   <chr>     <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> 
#> 1 Jabba D…    175  1358 <NA>       green-tan,… orange         600   herma…
#> 2 IG-88       200   140 none       metal       red             15.0 none  
#> 3 BB8          NA    NA none       none        black           NA   none  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

# (+) Of which species are there at least 2 different gender values?
sw %>%
  group_by(species, gender) %>%
  count() %>%  # table shows species by gender: 
  group_by(species) %>%  # Which species appear more than once in this table? 
  count() %>%
  filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups: species [5]
#>   species     nn
#>   <chr>    <int>
#> 1 Droid        2
#> 2 Human        2
#> 3 Kaminoan     2
#> 4 Twi'lek      2
#> 5 <NA>         2

# alternative (and shorter) solution:
sw %>%
  group_by(species)%>%
  summarise(n_gender_vals = n_distinct(gender)) %>%
  filter(n_gender_vals >= 2)
#> # A tibble: 5 x 2
#>   species  n_gender_vals
#>   <chr>            <int>
#> 1 Droid                2
#> 2 Human                2
#> 3 Kaminoan             2
#> 4 Twi'lek              2
#> 5 <NA>                 2

## (d) Homeworld issues: ----- 

# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from? 
sw %>%
  group_by(homeworld) %>%
  count() %>%
  arrange(desc(n))
#> # A tibble: 49 x 2
#> # Groups: homeworld [49]
#>    homeworld     n
#>    <chr>     <int>
#>  1 Naboo        11
#>  2 Tatooine     10
#>  3 <NA>         10
#>  4 Alderaan      3
#>  5 Coruscant     3
#>  6 Kamino        3
#>  7 Corellia      2
#>  8 Kashyyyk      2
#>  9 Mirial        2
#> 10 Ryloth        2
#> # ... with 39 more rows
# => Naboo (with 11 individuals)

# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld? 
sw %>% 
  filter(homeworld == "Naboo", eye_color == "orange") %>%
  summarise(n = n(),
            mn_height = mean(height))
#> # A tibble: 1 x 2
#>       n mn_height
#>   <int>     <dbl>
#> 1     3       209

## Note: 
sw %>% 
  filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 x 13
#>   name    height   mass hair_color skin_color  eye_color birth_year gender
#>   <chr>    <int>  <dbl> <chr>      <chr>       <chr>          <dbl> <chr> 
#> 1 Jabba …    175 1358   <NA>       green-tan,… orange         600   herma…
#> 2 Ackbar     180   83.0 none       brown mott… orange          41.0 male  
#> 3 Jar Ja…    196   66.0 none       orange      orange          52.0 male  
#> 4 Roos T…    224   82.0 none       grey        orange          NA   male  
#> 5 Rugor …    206   NA   none       green       orange          NA   male  
#> 6 Sebulba    112   40.0 none       grey, red   orange          NA   male  
#> 7 Ben Qu…    163   65.0 none       grey, gree… orange          NA   male  
#> 8 Saesee…    188   NA   none       pale        orange          NA   male  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>


# (+) What is the mass and homeworld of the smallest droid?
sw %>% 
  filter(species == "Droid") %>%
  arrange(height)
#> # A tibble: 5 x 13
#>   name  height  mass hair_color skin_color  eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> 
#> 1 R2-D2     96  32.0 <NA>       white, blue red             33.0 <NA>  
#> 2 R5-D4     97  32.0 <NA>       white, red  red             NA   <NA>  
#> 3 C-3PO    167  75.0 <NA>       gold        yellow         112   <NA>  
#> 4 IG-88    200 140   none       metal       red             15.0 none  
#> 5 BB8       NA  NA   none       none        black           NA   none  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

## (4) Group summaries: ----- 

# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
  filter(species == "Droid") %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            md_height = median(height, na.rm = TRUE),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#>       n not_NA_h md_height mn_height sd_height
#>   <int>    <int>     <dbl>     <dbl>     <dbl>
#> 1     5        4       132       140      52.0

# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
  group_by(species) %>%
  summarise(n = n(),
            not_NA_h = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            not_NA_m = sum(!is.na(mass)),
            md_mass = median(mass, na.rm = TRUE)
            )
h_m
#> # A tibble: 38 x 6
#>    species       n not_NA_h mn_height not_NA_m md_mass
#>    <chr>     <int>    <int>     <dbl>    <int>   <dbl>
#>  1 Aleena        1        1      79.0        1    15.0
#>  2 Besalisk      1        1     198          1   102  
#>  3 Cerean        1        1     198          1    82.0
#>  4 Chagrian      1        1     196          0    NA  
#>  5 Clawdite      1        1     168          1    55.0
#>  6 Droid         5        4     140          4    53.5
#>  7 Dug           1        1     112          1    40.0
#>  8 Ewok          1        1      88.0        1    20.0
#>  9 Geonosian     1        1     183          1    80.0
#> 10 Gungan        3        3     209          2    74.0
#> # ... with 28 more rows

# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#>   species            n not_NA_h mn_height not_NA_m md_mass
#>   <chr>          <int>    <int>     <dbl>    <int>   <dbl>
#> 1 Yoda's species     1        1      66.0        1    17.0
#> 2 Aleena             1        1      79.0        1    15.0
#> 3 Ewok               1        1      88.0        1    20.0

# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>%  slice(1:3)
#> # A tibble: 3 x 6
#>   species     n not_NA_h mn_height not_NA_m md_mass
#>   <chr>   <int>    <int>     <dbl>    <int>   <dbl>
#> 1 Hutt        1        1       175        1    1358
#> 2 Kaleesh     1        1       216        1     159
#> 3 Wookiee     2        2       231        2     124

pt <- pt + 12  # increment point total

Task 7: Hot and wet flights

Using the nycflights13::weather dataset:

Use the data set nycflights13::weather for questions that require filter, arrange, select, group_by, summarise (count, NAs, means, medians), etc.

7a. Save the tibble nycflights13::weather as wt and report its dimensions. (1 point)

7b. Missing values and known unknowns:

  • How many missing (NA) values does sw contain? (1 point)

  • What is the percentage of missing (NA) values in wt? (1 point)

  • What is the range (i.e., minimum and maximum value) of the year variable? (1 point)

7c. How many observations (rows) does the data contain for each of the 3 airports (origin)? (1 point)

7d. Compute a new variable temp_dc that provides the temperature (in degrees Celsius) that corresponds to temp (in degrees Fahrenheit).

Hint: The formula for conversion from Fahrenheit (degrees F) to Celsius (degrees C) is: \(C = (F - 32) \times\ 5/9\).

Add your new temp_dc variable to a new dataset wt_2 and re-arrange its columns so that your new temp_dc variable appears next to temp. (2 points)

7e. When only considering “JFK” airport:

  • What are the 3 (different) dates with the (a) coldest and (b) hottest temperatures at this airport?

Report the 3 dates and their extreme temperatures (in degrees Celsius) for (a) and (b). (2 points)

7f. Plot the amount of mean precipitation by month for each of the 3 airports (origin). (2 points)

Hint: First use dplyr to compute a table of means (by origin and month). Then use ggplot to draw a line or bar plot of the means.

7g. For each of the 3 airports:

  • When excluding extreme cases of precipitation (specifically, values of precip greater than 0.30):
    Does it rain more during winter months (Oct to Mar) or during summer months (Apr to Sep)? (2 points)

  • Plot the total amount of precipitation in winter vs. summer for each airport. (2 points)

Hints: Use filter to remove cases of extreme precipitation and create a logical variable (e.g., summer) that is TRUE for summer months and FALSE for winter months. The use dplyr to summarise the total amount (sum) of precipitation by origin and summer. The resulting tibble can be plotted as a bar chart (with different facets by origin).

library(nycflights13)

# nycflights13::weather
# ?weather

## How many observations (rows) and variables (columns) does the data set contain overall?
wt <- nycflights13::weather
dim(wt)
#> [1] 26130    15

## (b) missing values and ranges: 
## - How many missing values does the `weather` data contain?
sum(is.na(weather))  # sum of NA values
#> [1] 3157
mean(is.na(weather)) # percentage
#> [1] 0.008054599

## - What is the range of values of the `year` variable?
range(weather$year)
#> [1] 2013 2013

## (c) How many observations (rows) does the data contain for each of the 3 airports (`origin`)?
weather %>%
  group_by(origin) %>%
  count()
#> # A tibble: 3 x 2
#> # Groups: origin [3]
#>   origin     n
#>   <chr>  <int>
#> 1 EWR     8708
#> 2 JFK     8711
#> 3 LGA     8711

## (d) Conversion from Fahrenheit to Celsius: 
## Compute a variable `temp_dc` that provides the temperature (in degrees Celsius) 
## that corresponds to `temp` (in degrees Fahrenheit).
## Fahrenheit (degrees F) to Celsius (degrees C) conversion:
## C = (F - 32) x 5/9.

## Add your new `temp_dc` variable to a new dataset `wt_2` and 
## re-arrange its columns so that your `temp_dc` variable appears next to `temp`.

wt_2 <- wt %>%
  mutate(temp_dc = (temp - 32) * 5/9) %>%
  select(origin:temp, temp_dc, everything())
wt_2
#> # A tibble: 26,130 x 16
#>    origin  year month   day  hour  temp temp_dc  dewp humid wind_dir
#>    <chr>  <dbl> <dbl> <int> <int> <dbl>   <dbl> <dbl> <dbl>    <dbl>
#>  1 EWR     2013  1.00     1     0  37.0    2.80  21.9  54.0      230
#>  2 EWR     2013  1.00     1     1  37.0    2.80  21.9  54.0      230
#>  3 EWR     2013  1.00     1     2  37.9    3.30  21.9  52.1      230
#>  4 EWR     2013  1.00     1     3  37.9    3.30  23.0  54.5      230
#>  5 EWR     2013  1.00     1     4  37.9    3.30  24.1  57.0      240
#>  6 EWR     2013  1.00     1     6  39.0    3.90  26.1  59.4      270
#>  7 EWR     2013  1.00     1     7  39.0    3.90  27.0  61.6      250
#>  8 EWR     2013  1.00     1     8  39.0    3.90  28.0  64.4      240
#>  9 EWR     2013  1.00     1     9  39.9    4.40  28.0  62.2      250
#> 10 EWR     2013  1.00     1    10  39.0    3.90  28.0  64.4      260
#> # ... with 26,120 more rows, and 6 more variables: wind_speed <dbl>,
#> #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>, time_hour
#> #   <dttm>

## (e) Only considering "JFK" airport: 
## What are the 3 (different) dates with the (a) coldest and (b) hottest temperatures there?
## Report the 3 dates and their extreme temperatures (in degrees Celsius) for (a) and (b). 
JFK_temp <- wt_2 %>%
  filter(origin == "JFK") %>%
  arrange(temp_dc)

JFK_temp # => coldest days
#> # A tibble: 8,711 x 16
#>    origin  year month   day  hour  temp temp_dc   dewp humid wind_dir
#>    <chr>  <dbl> <dbl> <int> <int> <dbl>   <dbl>  <dbl> <dbl>    <dbl>
#>  1 JFK     2013  1.00    23     9  12.0   -11.1 - 7.06  41.3    280  
#>  2 JFK     2013  1.00    23    10  12.0   -11.1 - 5.08  45.4    280  
#>  3 JFK     2013  1.00    23    11  12.0   -11.1 - 5.08  45.4    290  
#>  4 JFK     2013  1.00    23     6  12.9   -10.6 - 9.04  36.0    290  
#>  5 JFK     2013  1.00    23     7  12.9   -10.6 - 7.96  38.0    290  
#>  6 JFK     2013  1.00    23     8  12.9   -10.6 - 7.06  39.7    290  
#>  7 JFK     2013  1.00    23    12  12.9   -10.6 - 4.00  46.0    280  
#>  8 JFK     2013  1.00    24     6  12.9   -10.6 - 2.02  50.5     40.0
#>  9 JFK     2013  5.00     9     2  13.1   -10.5  12.0   95.3     80.0
#> 10 JFK     2013  1.00    23     4  14.0   -10.0 - 9.94  32.9    300  
#> # ... with 8,701 more rows, and 6 more variables: wind_speed <dbl>,
#> #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>, time_hour
#> #   <dttm>
JFK_temp %>% arrange(desc(temp)) # => hottest days
#> # A tibble: 8,711 x 16
#>    origin  year month   day  hour  temp temp_dc  dewp humid wind_dir
#>    <chr>  <dbl> <dbl> <int> <int> <dbl>   <dbl> <dbl> <dbl>    <dbl>
#>  1 JFK     2013  7.00    18    16  98.1    36.7  66.9  36.4    260  
#>  2 JFK     2013  7.00    18    15  97.0    36.1  68.0  39.0    330  
#>  3 JFK     2013  7.00    18    18  97.0    36.1  71.1  43.4    220  
#>  4 JFK     2013  7.00    16    18  96.1    35.6  62.1  32.6    360  
#>  5 JFK     2013  7.00    18    14  96.1    35.6  66.9  38.7    350  
#>  6 JFK     2013  7.00    18    17  96.1    35.6  71.1  44.6    250  
#>  7 JFK     2013  7.00    15    20  95.0    35.0  68.0  41.5    360  
#>  8 JFK     2013  7.00    16    17  95.0    35.0  60.1  31.4     20.0
#>  9 JFK     2013  7.00    17    15  95.0    35.0  64.9  37.3    350  
#> 10 JFK     2013  7.00    20    20  95.0    35.0  69.1  43.1    260  
#> # ... with 8,701 more rows, and 6 more variables: wind_speed <dbl>,
#> #   wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>, time_hour
#> #   <dttm>


## Aggregation examples: -----

## (x) Average temperature per month: 
##     (used in class): 

mn_temp_month <- wt_2 %>%
  # group_by(origin, month) %>%
  group_by(month) %>%
  summarise(n = n(),
            n_not_NA = sum(!is.na(temp_dc)), 
            mn_temp_dc = mean(temp_dc, na.rm = TRUE))
mn_temp_month
#> # A tibble: 12 x 4
#>    month     n n_not_NA mn_temp_dc
#>    <dbl> <int>    <int>      <dbl>
#>  1  1.00  2229     2229       2.02
#>  2  2.00  2010     2010       1.20
#>  3  3.00  2227     2227       4.34
#>  4  4.00  2159     2159      10.9 
#>  5  5.00  2232     2232      16.4 
#>  6  6.00  2160     2160      22.3 
#>  7  7.00  2228     2228      26.7 
#>  8  8.00  2217     2216      23.6 
#>  9  9.00  2159     2159      19.7 
#> 10 10.0   2212     2212      15.6 
#> 11 11.0   2138     2138       7.28
#> 12 12.0   2159     2159       3.54

ggplot(mn_temp_month, aes(x = month, y = mn_temp_dc)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = 1:12) +
  theme_bw()


## (f) Plot the amount of mean precipitation (by origin and month):

wt %>%
  group_by(origin, month) %>%
  summarise(n = n(),
            n_not_NA = sum(!is.na(precip)), 
            mn_precip = mean(precip, na.rm = TRUE)) %>%
  ggplot(aes(x = month, y = mn_precip, color = origin, shape = origin)) +
  geom_point(size = 2) +
  geom_line(size = 1) +
  # geom_bar(aes(fill = origin), stat = "identity", position = "dodge") +
  scale_x_continuous(breaks = 1:12) +
  labs(title = "Mean precipitation by month and origin",
       x = "Month", y = "Mean precipitation", 
       caption = "[Data from nycflights13::weather]") + 
  theme_bw()


## Computation and visualization to answer a question: ----

## (g) For each of the 3 airports: 
## When excluding extreme cases of precipitation (values of `precip` greater than 0.30): 
## Does it rain more during winter (Oct to Mar) or during summer (Apr to Sep) months?
## Plot the total amount of precipitation in winter vs. summer for each airport. 

## Inspect data:
wt # month is given numerically!
#> # A tibble: 26,130 x 15
#>    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
#>    <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
#>  1 EWR     2013  1.00     1     0  37.0  21.9  54.0      230      10.4 
#>  2 EWR     2013  1.00     1     1  37.0  21.9  54.0      230      13.8 
#>  3 EWR     2013  1.00     1     2  37.9  21.9  52.1      230      12.7 
#>  4 EWR     2013  1.00     1     3  37.9  23.0  54.5      230      13.8 
#>  5 EWR     2013  1.00     1     4  37.9  24.1  57.0      240      15.0 
#>  6 EWR     2013  1.00     1     6  39.0  26.1  59.4      270      10.4 
#>  7 EWR     2013  1.00     1     7  39.0  27.0  61.6      250       8.06
#>  8 EWR     2013  1.00     1     8  39.0  28.0  64.4      240      11.5 
#>  9 EWR     2013  1.00     1     9  39.9  28.0  62.2      250      12.7 
#> 10 EWR     2013  1.00     1    10  39.0  28.0  64.4      260      12.7 
#> # ... with 26,120 more rows, and 5 more variables: wind_gust <dbl>, precip
#> #   <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

# ggplot(weather, aes(x = precip)) +
#  geom_histogram(binwidth = 0.01, fill = seeblau)

## Preparation: Filter out extreme values and add a summer variable: 
weather_season <- wt %>%
  filter(precip <= .30) %>%
  mutate(summer = (month > 3 & month < 10)#,
         # winter = (month < 4 | month > 9)
         )

## Computation: Sum of precipitation by origin and summer season:
sum_rain_season <- weather_season %>%
  group_by(origin, summer) %>%
  summarise(n = n(),
            # n_not_NA = sum(!is.na(precip)), 
            # mn_precip = mean(precip, na.rm = TRUE),
            sum_precip = sum(precip, na.rm = TRUE))
sum_rain_season
#> # A tibble: 6 x 4
#> # Groups: origin [?]
#>   origin summer     n sum_precip
#>   <chr>  <lgl>  <int>      <dbl>
#> 1 EWR    F       4320      10.9 
#> 2 EWR    T       4380      11.9 
#> 3 JFK    F       4323      10.3 
#> 4 JFK    T       4379      10.3 
#> 5 LGA    F       4323      10.0 
#> 6 LGA    T       4384       9.21

## Visualization: Sum of precipitation as a bar chart:
ggplot(weather_season, aes(x = summer, fill = summer)) +
  facet_wrap(~origin) + 
  geom_bar(aes(weight = precip), na.rm = TRUE) +
  scale_fill_manual(name = "Summer:", values = c("steelblue3", "firebrick")) +
  labs(title = "Sum of precipitation in winter vs. summer months",
       x = "Summer", y = "Sum of precipitation", 
       caption = "[Data from nycflights13::weather]") + 
  theme_bw()


pt <- pt + 15  # increment point total

Task 8: Not all outliers are alike

This task examines the statistical definition of outliers and uses a generated dataset (entitled out.csv and available at http://rpository.com/ds4psy/mt/out.csv). Use the following read_csv() command to obtain and load it into R:

## Load data (as comma-separated file): 
data <- read_csv("http://rpository.com/ds4psy/mt/out.csv")  # from online source

## Alternatively (from local source): 
# data <- read_csv("out.csv")  # from current directory

An outlier can be defined as an individual whose value in some metric deviates by more than a given criterion (e.g., 2 standard deviations) from the mean. But this definition is incomplete unless it also specifies an appropriate reference group. This task explores the implications of different reference groups.

8a. Save the data into a tibble data and report its number of observations and variables. (1 point)

8b. How many missing data values are there in data? (1 point)

8c. What is the gender (or sex) distribution in this sample? (1 point)

8d. Create a plot that shows the distribution of height values for each gender. (1 point)

8e. Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height:

  1. outliers relative to the height of the overall sample (i.e., individuals with height values deviating more than 2 SD from the overall mean of height) (1 point);

  2. outliers relative to the height of some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals with height values deviating more than 2 SD from the mean height of their own gender). (1 point)

Hints: As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE or FALSE) and added as new columns to data (via appropriate mutate commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping data by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x and y is provided by abs(x - y).

8f. Use the 2 new variables to define and identify 2 subgroups of people:

  1. out_1: Individuals (females and males) with height values that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are in data? (1 point)

  2. out_2: Individuals (females and males) with height values that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are in data? (1 point)

8g. Bonus task: Visualize the raw values and distributions of height for both types of outliers (out_1 and out_2) in 2 separate plots and describe the height and sex combination of the individuals shown in each plot. (2 bonus points)

## (a) Load and inspect data:
# data <- read_csv("out.csv") # read in csv-file
# data <- as_tibble(data)   # if data is not already a tibble
dim(data)  # => 1000 observations (rows) x 3 variables (columns)
#> [1] 1000    3

## (b) Missing data points: 
sum(is.na(data))  # => 18 missing values
#> [1] 18

## (c) Gender distribution: 
data %>% 
  group_by(sex) %>% 
  count()
#> # A tibble: 2 x 2
#> # Groups:   sex [2]
#>      sex     n
#>    <chr> <int>
#> 1 female   507
#> 2   male   493
# => 50.7% females, 49.3% males.

## (d) Distributions of `height` as density plot: 
ggplot(data, aes(x = height)) +
  geom_density(fill = "gold", alpha = 2/3) +
  geom_density(aes(fill = sex), alpha = 2/5) +
  labs(title = "Distribution of heights overall and by gender", 
       fill = "Gender") + 
  scale_fill_manual(values = c("firebrick", "steelblue3")) +
  theme_bw()


# Note: To avoid the Warning about removing 18 cases with NA-values, 
#       we could first filter out those cases:
# non_NA_data <- filter(data, !is.na(height))

## Alternative solution as 2 histograms: 
ggplot(data) +
  facet_wrap(~sex) + 
  geom_histogram(aes(x = height, fill = sex), binwidth = 5, color = "grey10") +
  labs(title = "Distribution of heights by gender",
       fill = "Gender") +
  scale_fill_manual(values = c("firebrick", "steelblue3")) +
  theme_bw()


## (+) Included in (e), but also possible to do separately:  
##     Compute the number, means and SD of height values in 2 ways: 
{
  ## 1. overall: 
  data %>%
    summarise(n = n(),
              n_not_NA = sum(!is.na(height)),
              mn_height = mean(height, na.rm = TRUE),
              sd_height = sd(height, na.rm = TRUE))
  
  ## 2. by gender:
  data %>%
    group_by(sex) %>%
    summarise(n = n(),
              n_not_NA = sum(!is.na(height)),
              mn_height = mean(height, na.rm = TRUE),
              sd_height = sd(height, na.rm = TRUE))
  }
#> # A tibble: 2 x 5
#>      sex     n n_not_NA mn_height sd_height
#>    <chr> <int>    <int>     <dbl>     <dbl>
#> 1 female   507      501  169.0679  8.193619
#> 2   male   493      481  180.5676 11.026677


## (e) Detecting and marking outliers (by logical variables): ----- 

## Compute the means, SDs, and corresponding outliers in 2 ways:

crit <- 2  # criterion value for detecting outliers (in SD units)

data_out <- data %>%      
  # 1. Compute means, SD, and outliers for overall sample: 
  mutate(mn_height  = mean(height, na.rm = TRUE),  
         sd_height  = sd(height, na.rm = TRUE),
         out_height = abs(height - mn_height) > (crit * sd_height)) %>%
  group_by(sex) %>%       
  # 2. Compute same metrics for subgroups (by sex):
  mutate(mn_sex_height  = mean(height, na.rm = TRUE), 
         sd_sex_height  = sd(height, na.rm = TRUE),
         out_sex_height = abs(height - mn_sex_height) > (crit * sd_sex_height))

# data_out

## (f) Identify 2 types of outliers: ----- 

## 1. Outliers relative to the entire population AND to their own gender: 
out_1 <- data_out %>%
  filter(out_height & out_sex_height) %>%
  arrange(sex, height)

nrow(out_1) # => 21 individuals. 
#> [1] 21

## 2. Outliers relative to their own gender, but NOT relative to the entire population:
out_2 <- data_out %>%
  filter(!out_height & out_sex_height) %>%
  arrange(sex, height)  

nrow(out_2) # => 24 individuals.
#> [1] 24


## (g) Bonus task:  
## Visualization and interpretation of both types of outliers: ----- 

## 1. Showing out_1: 
ggplot(out_1, aes(x = sex, y = height)) +
  geom_violin(aes(fill = sex)) + 
  geom_jitter(size = 4, alpha = 2/3) + 
  scale_fill_manual(values = c("firebrick", "steelblue3")) +
  labs(title = "Outliers relative to both overall sample and gender", 
       x = "Gender", y = "Height (in cm)", 
       fill = "Gender:") +
  theme_bw()


# Interpretation: 
# `out_1` contains mostly short women (except for 1 tall woman) 
#  and mostly tall men (except for 2 short men). 

## 2. Showing out_2: 
ggplot(out_2, aes(x = sex, y = height)) +
  geom_violin(aes(fill = sex)) + 
  geom_jitter(size = 4, alpha = 2/3) + 
  scale_fill_manual(values = c("firebrick", "steelblue3")) +
  labs(title = "Outliers relative to gender but not overall sample", 
       x = "Gender", y = "Height (in cm)", 
       fill = "Gender:") +
  theme_bw()


# Interpretation: 
# `out_2` contains individuals which are either tall women or short men.

pt <- pt + (8 + 0)  # increment point total (leaving out 2 bonus points)

Task 9: The length of flights

Using the nycflights13::flights dataset to compute the actual duration of flights:

  • Compute a variable true_duration as the duration of each flight (in minutes) from its dep_time and arr_time.

  • How does it relate to the air_time variable in the data set? (Plot the relationship between both variables.)

library(nycflights13) # loads flights dataset

compute_duration <- function(dep_min, arr_min) {

  dur <- NA # initialize
  
  # Distinguish between 2 cases:
  if (dep_min < arr_min) {  # dep before arr (i.e., same day): 
    dur <- arr_min - dep_min
  } else { # dep later than arr (i.e., different days): 
    dur <- (arr_min + 24 * 60) - dep_min 
  }
  
  return(dur)
  
}

# Check for vectors: 
dep <- seq(0, 24*60, by = +15)
arr <- seq(24*60, 0, by = -15)

# compute_duration(dep, arr)
# ???: How to apply a function to each pair of values of 2 vectors/columns?

d <- flights %>% 
  filter(origin == "JFK") %>%  # to reduce size of dataset
  select(dep_time, arr_time, air_time) %>%
  mutate(dep_time_min = ((dep_time %/% 100) * 60) + (dep_time %% 100),
         arr_time_min = ((arr_time %/% 100) * 60) + (arr_time %% 100),
         # true_duration = compute_duration(dep_time_min, arr_time_min),
         true_duration = arr_time_min - dep_time_min) %>%
  filter(air_time > 0, true_duration > 0)

p <- ggplot(d, (aes(x = air_time, y = true_duration))) +
  geom_point(alpha = 1/4) +
  geom_abline(intercept = 0, slope = 1, linetype = 2, size = 1, color = "red3") +
  theme_bw()

Notes

  • The dep_time and arr_time are specified in terms of hour and minutes. As an hour contains 60 minutes (rather than 100), we first use modular arithmetic to transform both variables into a metric of minutes (see variables dep_time_min and arr_time_min).

  • For flights departing and arriving on the same day, we can simply subtract dep_time_min from arr_time_min to obtain true_duration. However, if a flight departs before and arrives after midnight (i.e., arrives on the next day), this measure would yield a negative result. To correct for this, we need to add 24 hours (24 times 60 minutes) to arr_time whenever arr_time is before (or smaller than) dep_time. [This assumes that there are no flights exceeding 24 hours.]

  • Seems too difficult at this stage, as it requires conditional execution (for flights departing and arriving on the same vs. different days) and/or functions.

In chapter 16 (Dates and times: http://r4ds.had.co.nz/dates-and-times.html), this problem is solved by introducing a Boolean variable for overnight flights and re-computing air_time by subtracting date-time objects (as intervals).

7: Exploratory Data Analysis

Key skills

Transform and visualize datasets to:

  • detect and deal with missing (NA) values;
  • view and interpret distributions of variables (e.g., to see patterns and deal with unusual or extreme values);
  • view and interpret relationships between (categorical/continuous) variables.

Example tasks

Additional examples of tasks are contained in the other chapters (see tasks on “multiple chapters” below).

Task 10: Cars

Using the mtcars dataset:

11. The mtcars data (contained in R datasets) contains …

## Data to use:
# ?datasets::mtcars
# mtcars

## (0) Convert into a tibble: 
df <- as_tibble(rownames_to_column(mtcars, var = "model"))
df$cyl <- factor(df$cyl)
df$am <- factor(df$am)
df
#> # A tibble: 32 x 12
#>    model   mpg cyl    disp    hp  drat    wt  qsec    vs am     gear  carb
#>    <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#>  1 Mazd…  21.0 6       160 110    3.90  2.62  16.5  0    1      4.00  4.00
#>  2 Mazd…  21.0 6       160 110    3.90  2.88  17.0  0    1      4.00  4.00
#>  3 Dats…  22.8 4       108  93.0  3.85  2.32  18.6  1.00 1      4.00  1.00
#>  4 Horn…  21.4 6       258 110    3.08  3.22  19.4  1.00 0      3.00  1.00
#>  5 Horn…  18.7 8       360 175    3.15  3.44  17.0  0    0      3.00  2.00
#>  6 Vali…  18.1 6       225 105    2.76  3.46  20.2  1.00 0      3.00  1.00
#>  7 Dust…  14.3 8       360 245    3.21  3.57  15.8  0    0      3.00  4.00
#>  8 Merc…  24.4 4       147  62.0  3.69  3.19  20.0  1.00 0      4.00  2.00
#>  9 Merc…  22.8 4       141  95.0  3.92  3.15  22.9  1.00 0      4.00  2.00
#> 10 Merc…  19.2 6       168 123    3.92  3.44  18.3  1.00 0      4.00  4.00
#> # ... with 22 more rows

## (1) Distribution of mpg: ---- 
range(df$mpg)
#> [1] 10.4 33.9

ggplot(df, aes(x = mpg)) +
  geom_histogram(binwidth = 5) 


## (2) Group means and boxplot: Mean mpg by cylinder: ----  

df %>%
  group_by(cyl) %>%
  summarise(n = n(),
            md_mpg = median(mpg),
            mn_mpg = mean(mpg)
            )
#> # A tibble: 3 x 4
#>   cyl       n md_mpg mn_mpg
#>   <fct> <int>  <dbl>  <dbl>
#> 1 4        11   26.0   26.7
#> 2 6         7   19.7   19.7
#> 3 8        14   15.2   15.1

ggplot(df, aes(x = cyl, y = mpg, color = cyl)) +
  geom_boxplot() +
  # geom_violin() + 
  geom_jitter()


## (3) Scatterplot: ---- 

## Show value of `disp` (on y-axis) by `hp` (on x-axis), grouped by `am`: 
ggplot(df, aes(x = hp, y = disp, fill = am)) +
  geom_point(shape = 21, size = 2.5) +
  # geom_smooth() +
  geom_text(aes(label = model), size = 2.5, vjust = 0, hjust = 0, nudge_x = 5) +
  theme_bw()


## Alternative (using facets):
ggplot(df, aes(x = hp, y = disp)) +
  facet_wrap(~am) + 
  geom_point(size = 2.5) +
  # geom_smooth() +
  geom_text(aes(label = model), size = 2.5, vjust = 0, hjust = 0, nudge_x = 5) +
  theme_bw()


## Conclusions: ---- 
# - There is a positive correlation between horsepower and displacement 
#   (for both types of transmission). 
# - outliers: Ford Pantera L and Maserati 
#             have more hp and disp than other cars with manual transmission.

Tasks involved

  • Plot variable distributions (histogram);
  • Compute descriptive group measures (e.g., counts, means, SDs);
  • Plot raw data and group means (boxplot);
  • Plot a scatterplot of 2 continuous variables;
  • Interpret tables and graphs and draw conclusions.

10: Tibbles

Key skills

Turn data into tibbles:

  • convert data frames into tibbles (by as_tibble);
  • create tibbles from tabular data (by tibble and tribble).

Example tasks

The commands to create and convert to tibbles are included in tasks to other chapters.

12: Tidy Data

Key skills

  • Identify and create “tidy” datasets.
  • Wrangle data to:
    • gather (wider) datasets into longer ones;
    • spread (longer) datasets into wider ones;
    • separate and unite variables (columns).

Example tasks

Task 12: Taking stocks

The following table shows the start and end price of 3 stocks on 3 days (d1, d2, d3):

Stock data example showing the start and end prices of the shares of 3 companies on 3 days.
stock d1_start d1_end d2_start d2_end d3_start d3_end
Amada 2.5 3.6 3.5 4.2 4.4 2.8
Betix 3.3 2.9 3.0 2.1 2.3 2.5
Cevis 4.2 4.8 4.6 3.1 3.2 3.7

12a. Create a tibble st that contains this data in this (wide) format. (1 point)

12b. Transform st into a longer table st_long that contains 18 rows and only 1 numeric variable for all stock prices. Adjust this table so that the day and time appear as 2 separate columns. (2 points)

12c. Create a graph that shows the 3 stocks’ end prices (on the y-axis) over the 3 days (on the x-axis). (1 point)

12d. Spread st_long into a wider table that contains start and end prices as 2 distinct variables (columns) for each stock and day. (1 point)

# library(tidyverse)

## (a) Enter stock data (in wide format) as a tibble:
st <- tribble(
  ~stock, ~d1_start, ~d1_end, ~d2_start, ~d2_end, ~d3_start, ~d3_end,  
  #-----|----------|--------|----------|--------|----------|--------|
  "Amada",   2.5,     3.6,    3.5,       4.2,      4.4,       2.8,            
  "Betix",   3.3,     2.9,    3.0,       2.1,      2.3,       2.5,  
  "Cevis",   4.2,     4.8,    4.6,       3.1,      3.2,       3.7     
)
dim(st)
#> [1] 3 7

## Note data structure: 
## 2 nested factors: day (1 to 3), type (start or end).

## (b) Change from wide to long format 
##     that contains the day (d1, d2, d3) and type (start vs. end) as separate columns:
st_long <- st %>%
  gather(d1_start:d3_end, key = "key", value = "val") %>%
  separate(key, into = c("day", "time")) %>%
  arrange(stock, day, time) # optional: arrange rows
st_long
#> # A tibble: 18 x 4
#>    stock   day  time   val
#>    <chr> <chr> <chr> <dbl>
#>  1 Amada    d1   end   3.6
#>  2 Amada    d1 start   2.5
#>  3 Amada    d2   end   4.2
#>  4 Amada    d2 start   3.5
#>  5 Amada    d3   end   2.8
#>  6 Amada    d3 start   4.4
#>  7 Betix    d1   end   2.9
#>  8 Betix    d1 start   3.3
#>  9 Betix    d2   end   2.1
#> 10 Betix    d2 start   3.0
#> 11 Betix    d3   end   2.5
#> 12 Betix    d3 start   2.3
#> 13 Cevis    d1   end   4.8
#> 14 Cevis    d1 start   4.2
#> 15 Cevis    d2   end   3.1
#> 16 Cevis    d2 start   4.6
#> 17 Cevis    d3   end   3.7
#> 18 Cevis    d3 start   3.2

## (c) Plot the end values (on the y-axis) of the 3 stocks over 3 days (x-axis):
st_long %>% 
  filter(time == "end") %>%
  ggplot(aes(x = day, y = val, color = stock, shape = stock)) +
  geom_point(size = 4) + 
  geom_line(aes(group = stock)) +
  ## Pimping plot: 
  labs(title = "End prices of stocks", 
       x = "Day", y = "End price", 
       shape = "Stock:", color = "Stock:") +
  theme_bw()

## (d) Change st_long into a wider format that lists start and end as 2 distinct variables (columns):
st_long %>%
  spread(key = time, value = val) %>%
  mutate(day_nr = parse_integer(str_sub(day, 2, 2))) # optional: get day_nr as integer variable
#> # A tibble: 9 x 5
#>   stock   day   end start day_nr
#>   <chr> <chr> <dbl> <dbl>  <int>
#> 1 Amada    d1   3.6   2.5      1
#> 2 Amada    d2   4.2   3.5      2
#> 3 Amada    d3   2.8   4.4      3
#> 4 Betix    d1   2.9   3.3      1
#> 5 Betix    d2   2.1   3.0      2
#> 6 Betix    d3   2.5   2.3      3
#> 7 Cevis    d1   4.8   4.2      1
#> 8 Cevis    d2   3.1   4.6      2
#> 9 Cevis    d3   3.7   3.2      3
 
pt <- pt + 5  # increment point total

Multiple chapters

Task 13: Flower power

The iris data (contained in R datasets) provides the measurements (in cm) of plant parts (length and width of sepal and petal parts) for 50 flowers from each of 3 species of iris (called setosa, versicolor, and virginica).

13a. Save datasets::iris a tibble ir that contains this data and inspect it. Are there any missing values? (2 points)

13b. Compute a summary table that shows the means of the 4 measurement columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) for each of the 3 Species (in rows). Save the resulting table of means as a tibble im1 . (2 points)

13c. Create a histogram that shows the distribution of Sepal.Width values across all species. (1 point)

13d. Create a plot that shows the shape of the distribution of Sepal.Width values for each species. (1 point)

13e. Create a plot that shows Petal.Width as a function of Sepal.Width separately (i.e., in 3 facets) for each species. (1 point)

# ?iris

## (a) Turn into tibble and inspect: ----- 
ir <- as_tibble(datasets::iris)
dim(ir)        # => 150 observations (rows) x 5 variables (columns)
#> [1] 150   5
sum(is.na(ir)) # =>   0 missing values
#> [1] 0

## (b) Compute counts and means by species: ----- 
im1 <- ir %>%
  group_by(Species) %>%
  summarise(n = n(),
            mn_sep.len = mean(Sepal.Length),
            mn_sep.wid = mean(Sepal.Width),
            mn_pet.len = mean(Petal.Length),
            mn_pet.wid = mean(Petal.Width)
            )

# Print im1 (as table with 4 variables of means): 
knitr::kable(im1, caption = "Average iris measures (4 variables of mean values).") 
Average iris measures (4 variables of mean values).
Species n mn_sep.len mn_sep.wid mn_pet.len mn_pet.wid
setosa 50 5.006 3.428 1.462 0.246
versicolor 50 5.936 2.770 4.260 1.326
virginica 50 6.588 2.974 5.552 2.026

## Graphical exploration: ----- 

## Distribution of 1 (continuous) variable:

## (c) Distribution of Sepal.Width across species:
ggplot(ir, aes(x = Sepal.Width)) +
  geom_histogram(binwidth = .1, fill = "forestgreen") + 
  labs(title = "Distribution of sepal width across iris species") +
  theme_bw()


## Distributions/relationships between 2 variables (1 categorical, 1 continuous):

## (d) The distributions of Sepal.Width by species:
ggplot(ir, aes(x = Species, y = Sepal.Width, fill = Species, shape = Species)) +
  geom_violin() +
  geom_point(size = 4, alpha = 1/2, position = "jitter") +
  labs(title = "Distributions of sepal width by iris species") +
  theme_bw()


## Alternative solution using density plots: 
ggplot(ir, aes(x = Sepal.Width, fill = Species)) +
  facet_wrap(~Species) + 
  geom_density() +
  labs(title = "Distributions of sepal width by iris species") + 
  coord_fixed() +
  theme_bw()


## Relationships between 2 variables (2 continuous):

## (e) Petal.Width as a function of Sepal.Width by iris species: 
ggplot(ir, aes(x = Sepal.Width, y = Petal.Width, color = Species, shape = Species)) +
  facet_wrap(~Species) +
  geom_jitter(size = 3, alpha = 2/3) +
  # geom_density2d() +
  coord_fixed() +
  labs(title = "Petal width as a function of sepal width by iris species") +
  theme_bw()


pt <- pt + 7  # increment point total

## next: Turn iris df into long format using tidyr::gather ----- 
## (see below)

13f. Re-format your tibble ir into a tibble ir_long which in “long format”. (2 points)

Hint: Use tidyr::gather and separate to turn ir into a tibble that contains only 1 dependent variable for the value of measurements (e.g., val), but 2 categorical variables that specify the part (Sepal vs. Petal) and metric (Length vs. Width) of each observation.

13g. Use ir_long to recompute the subgroup means (for each combination of species, plant part, and metric) computed in 13b.. Save the resulting table of means as a tibble im2 and verify that they have not changed from im1 above. (2 points)

13h. Visualize the relationships between the means of im2 (i.e., the mean measurements by plant part and metric) separately for each species. (2 points)

Hints: This task asks for showing the value of a continuous variable (the value of means) as a function of 2 categorical variables (the plant part and type of metric). Possible solutions could incorporate either geom_line or geom_tile and use different facets for different Species.

13i. Bonus task: Re-format your tibble ir_long (in long format) into a wider tibble ir_short that corresponds to the original ir dataset. (2 bonus points)

Hints: This task calls for applying tidyr::spread and unite commands to ir_long. However, spread will encounter an error unless every individual plant is identified by a unique variable (e.g., id number). This can be achieved by adding a numeric counter variable id (with values of rep(1:50, 3)) to ir before creating ir_long.

## Data (from above): 
# ?iris
# ir <- as_tibble(datasets::iris)
# ir

## (f) Re-format ir into long format (using tidyr::gather) ----- 
ir_long <- ir %>% 
  mutate(id = rep(1:50, 3)) %>%   # add a unique id to each plant [to enable (i) below]
  gather(Sepal.Length:Petal.Width, key = "type", value = "val") %>%
  separate("type", into = c("part", "metric"), sep = "\\.")

ir_long
#> # A tibble: 600 x 5
#>    Species    id  part metric   val
#>  *  <fctr> <int> <chr>  <chr> <dbl>
#>  1  setosa     1 Sepal Length   5.1
#>  2  setosa     2 Sepal Length   4.9
#>  3  setosa     3 Sepal Length   4.7
#>  4  setosa     4 Sepal Length   4.6
#>  5  setosa     5 Sepal Length   5.0
#>  6  setosa     6 Sepal Length   5.4
#>  7  setosa     7 Sepal Length   4.6
#>  8  setosa     8 Sepal Length   5.0
#>  9  setosa     9 Sepal Length   4.4
#> 10  setosa    10 Sepal Length   4.9
#> # ... with 590 more rows
dim(ir_long) # => 600 rows x 5 columns
#> [1] 600   5

## (g) Recompute group counts and means from ir_long: ----- 
im2 <- ir_long %>%
  group_by(Species, part, metric) %>%
  summarise(n = n(),
            mn_val = mean(val)
            )

# Print im2 (as table with 4 variables of means): 
knitr::kable(im2, caption = "Average iris measures (1 variable of mean values).")
Average iris measures (1 variable of mean values).
Species part metric n mn_val
setosa Petal Length 50 1.462
setosa Petal Width 50 0.246
setosa Sepal Length 50 5.006
setosa Sepal Width 50 3.428
versicolor Petal Length 50 4.260
versicolor Petal Width 50 1.326
versicolor Sepal Length 50 5.936
versicolor Sepal Width 50 2.770
virginica Petal Length 50 5.552
virginica Petal Width 50 2.026
virginica Sepal Length 50 6.588
virginica Sepal Width 50 2.974

## Check: Compare sums of all means in im1 vs. im2: 
sum1 <- sum(im1$mn_sep.len) + sum(im1$mn_sep.wid) + 
        sum(im1$mn_pet.len) + sum(im1$mn_pet.wid)  # => 41.574
sum2 <- sum(im2$mn_val)  # => 41.574
sum1 == sum2             # => TRUE (qed)
#> [1] TRUE


## Showing the value of a continuous variable by 2 categorical variables: ----- 

## (h) Visualize the means of im2 (i.e., mean measurements by plant part and metric) 
##     as a line plot separately for each species:
ggplot(im2, aes(x = part, y = mn_val, group = metric, color = metric)) +
  facet_wrap(~Species) +
  geom_point(aes(shape = metric), size = 3) +
  geom_line(aes(linetype = metric)) +
  labs(title = "Mean petal and sepal lengths and widths by iris species") + 
  theme_bw()


## Alternative solution using tile plots:
ggplot(im2, aes(x = part, y = metric)) +
  facet_wrap(~Species) +
  geom_tile(aes(fill = mn_val)) +
  geom_text(aes(label = mn_val), color = "white") + 
  labs(title = "Mean petal and sepal lengths and widths by iris species") + 
  theme_bw()


## Alternative: Using raw values (ir_long) to visualize the distributions 
##              of val as a function of Species, plant part, and metric:
ggplot(ir_long, aes(x = part, y = val, fill = Species)) + 
  facet_wrap(~Species) +
  geom_violin() +
  geom_jitter(aes(shape = metric), size = 2, alpha = 2/3) + 
  labs(title = "Mean petal and sepal lengths and widths by iris species") + 
  theme_bw()


## (i) Bonus task: ----- 

## See (f) above for the additional id-column for each plant:
## mutate(id = rep(1:50, 3)) %>% ...

## Re-format ir_long into shorter format (using tidyr::spread) ----- 
ir_short <- ir_long %>%
  unite(type, part, metric, sep = ".") %>%  # unite part and metric into "type" column
  spread(key = type, value = val) %>%       # spread "type" variable into multiple columns
  arrange(Species, id)

ir_short
#> # A tibble: 150 x 6
#>    Species    id Petal.Length Petal.Width Sepal.Length Sepal.Width
#>     <fctr> <int>        <dbl>       <dbl>        <dbl>       <dbl>
#>  1  setosa     1          1.4         0.2          5.1         3.5
#>  2  setosa     2          1.4         0.2          4.9         3.0
#>  3  setosa     3          1.3         0.2          4.7         3.2
#>  4  setosa     4          1.5         0.2          4.6         3.1
#>  5  setosa     5          1.4         0.2          5.0         3.6
#>  6  setosa     6          1.7         0.4          5.4         3.9
#>  7  setosa     7          1.4         0.3          4.6         3.4
#>  8  setosa     8          1.5         0.2          5.0         3.4
#>  9  setosa     9          1.4         0.2          4.4         2.9
#> 10  setosa    10          1.5         0.1          4.9         3.1
#> # ... with 140 more rows

## Verify identity of all measurement values in ir and ir_short: 
all.equal(ir$Sepal.Length, ir_short$Sepal.Length) & 
all.equal(ir$Sepal.Width,  ir_short$Sepal.Width)  &
all.equal(ir$Petal.Length, ir_short$Petal.Length) & 
all.equal(ir$Petal.Width,  ir_short$Petal.Width)  # => TRUE (qed) 
#> [1] TRUE

pt <- pt + (6 + 0)  # increment point total (leaving out 2 bonus points) 

Task 14: Numeracy vs. intelligence

This task uses a generated dataset (entitled numeracy.csv and available at http://rpository.com/ds4psy/data/numeracy.csv). Use the following read_csv() command to obtain and load it into R:

## Load data (as comma-separated file): 
data <- read_csv("http://rpository.com/ds4psy/data/numeracy.csv")  # from online source

## Alternatively (from local source): 
# data <- read_csv("numeracy.csv")  # from current directory

Data structure

This data is structured as follows:

  • Each row contains the data from one individual participant.

  • Six columns contain the following independent variables (IVs):

  1. name contains each participant’s initials;
  2. gender: each participant’s gender;
  3. bdate: each participant’s birth date;
  4. bweekday: the day of the week on which each participant was born;
  5. height: how tall each participant is (in cm);
  6. blood_type: each participant’s blood type;

The remaining six columns contain the following dependent variables (DVs):

  • bnt_1 to bnt_4 signal a correct (1) or incorrect (0) answer to the corresponding question of the Berlin Numeracy Test (Cokely et al., 2012, see http://www.riskliteracy.org/researchers/ for details). The sum of these 4 variables define the BNT numeracy score for each participant.

  • g_iq and s_iq provide two distinct measurements of each participant’s general vs. social intelligence.

Tasks

(A) Basics:

14a. Inspect the data dimensions and the percentage of missing values. (1 point)

Hints: As later parts of this task use variables created in earlier ones, you should either update your existing data with modified tibbles (e.g., including new variables) or incrementally save your new tibbles with new names (e.g., data_a, data_b, etc.).

14b. Split the birth date (bdate) variable into 3 separate variables (byear, bmonth, and bday) that denote the year, month, and day of each participant’s birth. (1 point)

Hints: Use tidyr::separate for this task, but check the type of your new variables (and use the convert argument to ensure that they are numeric).

14c. Create a new variable summer_born that is TRUE when a participant was born in summer (defined as April to September) and FALSE when a person was born in winter (October to March). (1 point)

(B) Assessing IVs:

14d. Compute the current age of each participant as the person would report it (i.e., in completed years, taking into account today’s date). (2 points)

14e. List the frequency of each blood type (blood_type) by gender (as a tibble by using dplyr) and re-format your tibble into a wider format (by using tidyr) that lists the types of gender in rows and the types of blood_type as columns. (2 points)

14f. Compute descriptives (the counts, means, and standard deviations) of height (a) by gender and (b) by cohort (i.e., the decade of birth). (3 points)

Hints: Use integer division on each participant’s age to determine his or her decade cohort: cohort = age %/% 10.

14g. Visualize the distributions of height (a) by gender and (b) by cohort. What do you find? (2 points)

Hints: Inspect the type of your cohort variable (by typeof(my_data$cohort)). Depending on your intended plot, you may have to turn this variable into a factor by using as.factor(my_data$cohort). The factor has discrete levels, while the cohort variable may have been continuous (numeric).

(C) Assessing DVs:

14h. Compute the aggregate BNT score as the sum of all four bnt_i values (i.e., a value varying from 0 to 4) for each participant. How many missing BNT values are there? (1 point)

14i. Inspect the values for the 2 intelligence scores g_iq and s_iq:

  • How many missing values are there?
  • What are their means and their ranges (minimum and maximum values)?
  • Plot their distributions in 2 separate histograms. (3 points)

14j. Visualize the relationship between g_iq and s_iq. Can you detect any systematic trend? (1 point)

Hints: Consider using a scatterplot, but take care of overplotting.

(D) Exploring results:

14k. Does numeracy (as measured by the aggregate BNT score) seem to vary by gender? (1 point)

14l. Assess possible effects of numeracy (as measured by BNT) on the 2 measures of intelligence (g_iq and s_iq). (2 points)

14m. Assess possible effects of the independent variables

  • gender (gender),
  • age (as provided by age or cohort from above),
  • birth season (summer_born),
  • blood type (blood_type), and
  • the day of the week on which a person was born (bweekday),

on each of the 2 types of intelligence (g_iq and s_iq) in 2 ways:

  1. numercially (by computing group means) and
  2. graphically (by plotting means and/or distributions).

Which plausible or implausible effects does the data suggest? (10 points)


## (A) Basics: ----- 

## (a) Inspect data:
data_a <- data 
dim(data_a) # => 1000 participants x 12 variables
#> [1] 1000   12
sum(is.na(data_a))   # 130 missing values
#> [1] 130
mean(is.na(data_a))  # 1.083 percent of missing data
#> [1] 0.01083333
# data_a

## (b) Split bdate into 3 separate variables (`byear`, `bmonth`, and `bday`):
# data_a$bdate       # inspect variable
length(data_a$bdate) # 1000 (i.e., no missing values) 
#> [1] 1000

data_b <- data_a %>%
  tidyr::separate(bdate, into = c("byear", "bmonth", "bday"), sep = "-", 
                  convert = TRUE, 
                  remove = FALSE) %>%
  select(name:bdate, byear, bmonth, bday, everything()) # re-arrange variables

# Note that byear:bmonth are converted into characters if convert = FALSE.
# Resulting data:
# data_b

## (c) Create a new variable `summer_born`:
##     `TRUE` iff born in summer (April to September)
data_c <- data_b %>% 
  mutate(summer_born = (bmonth > 3) & (bmonth < 10)) %>%
  select(name:bday, summer_born, everything()) # re-arrange variables
# data_c

## (B) Assessing IVs: ----- 

## (d) Age in (completed) years:

# Today's date: 
year_now  <- 2018
month_now <-    7
day_now   <-   13

data_d <- data_c %>%
  mutate(had_bday_this_year = (bmonth < month_now) | (bmonth = month_now & bday <= day_now),
         no_bday_this_year_yet = !had_bday_this_year, 
         age = (year_now - byear) - no_bday_this_year_yet) %>%
  select(name:bday, age, everything()) # re-arrange variables  
# data_d

# Note: A simpler solution using lubridate()
{
  library(lubridate)
  
  bday <- ymd("000713") # today, 18 years ago 
  (bday %--% today())   # yields a time interval
  
  # Define a function that computes current age: 
  cur_age <- function(bday) {
    
    lifetime <- (bday %--% today()) # interval from bday to today() 
    (lifetime %/% years(1))         # integer division (into full years)
    
  }
  
  # Check: 
  bday_1 <- ymd("000712") # year 2000 yesterday
  bday_2 <- ymd("000713") # year 2000 today
  bday_3 <- ymd("000714") # year 2000 tomorrow
  
  cur_age(bday_1) # => 18
  cur_age(bday_2) # => 18
  cur_age(bday_3) # => 17 (qed)
  }
#> [1] 17

## (e) Frequency of blood_type

## by gender:
btg <- data_d %>%
  group_by(gender) %>%
  count(blood_type)
btg
#> # A tibble: 16 x 3
#> # Groups:   gender [2]
#>    gender blood_type     n
#>     <chr>      <chr> <int>
#>  1 female         A+   194
#>  2 female         A−    41
#>  3 female        AB+     9
#>  4 female        AB−     5
#>  5 female         B+    37
#>  6 female         B−     7
#>  7 female         O+   204
#>  8 female         O−    34
#>  9   male         A+   158
#> 10   male         A−    27
#> 11   male        AB+    15
#> 12   male        AB−     4
#> 13   male         B+    44
#> 14   male         B−     3
#> 15   male         O+   184
#> 16   male         O−    34

## Re-format from long to wider format: 
btg %>%
  spread(key = blood_type, value = n)
#> # A tibble: 2 x 9
#> # Groups:   gender [2]
#>   gender  `A+`  `A−` `AB+` `AB−`  `B+`  `B−`  `O+`  `O−`
#> *  <chr> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 female   194    41     9     5    37     7   204    34
#> 2   male   158    27    15     4    44     3   184    34

## (f+g) Descriptives and distribution of height by gender:
data_d %>%
  group_by(gender) %>%
  summarise(n = n(),
            n_notNA = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE)
            )
#> # A tibble: 2 x 5
#>   gender     n n_notNA mn_height sd_height
#>    <chr> <int>   <int>     <dbl>     <dbl>
#> 1 female   531     531  160.7232  8.787674
#> 2   male   469     469  173.3475 11.034750

ggplot(data_d, aes(x = gender, y = height, color = gender)) +
  geom_violin() +
  geom_jitter(size = 2, alpha = 1/3) + 
  scale_color_brewer(palette = "Set1") + 
  labs(title = "Distribution of height by gender",
       x = "Gender", y = "Height (in cm)", 
       caption = "[ds4psy]") + 
  theme_bw()


## (f+g) Descriptives and distribution of height by cohort:
data_f <- data_d %>%
  mutate(cohort = age %/% 10)

data_f %>%
  group_by(cohort) %>%
  summarise(n = n(),
            n_notNA = sum(!is.na(height)),
            mn_height = mean(height, na.rm = TRUE),
            sd_height = sd(height, na.rm = TRUE)
            )
#> # A tibble: 9 x 5
#>   cohort     n n_notNA mn_height sd_height
#>    <dbl> <int>   <int>     <dbl>     <dbl>
#> 1      1    29      29  169.1724 10.285458
#> 2      2   157     157  170.9108 12.126669
#> 3      3   181     181  168.0773 11.756066
#> 4      4   162     162  166.8642 10.743483
#> 5      5   152     152  165.5987 10.861110
#> 6      6   127     127  166.7874 11.389657
#> 7      7   122     122  162.9918 11.479407
#> 8      8    60      60  160.2167 12.555088
#> 9      9    10      10  160.0000  8.819171

# Inspect cohort:
typeof(data_f$cohort)
#> [1] "double"
data_f$cohort <- as.factor(data_f$cohort)

ggplot(data_f, aes(x = cohort, y = height, color = cohort)) +
  geom_violin() +
  geom_jitter(size = 2, alpha = 1/2) + 
  scale_color_brewer(name = "Cohort:", palette = "Set1") + 
  labs(title = "Distribution of height by cohort",
       x = "Cohort (x 10 years)", y = "Height (in cm)", 
       caption = "[ds4psy: numeracy data]") + 
  theme_bw()


# Visualizing distributions by both gender and cohort: 
ggplot(data_f, aes(x = cohort, y = height, color = cohort)) +
  facet_wrap(~gender) + 
  geom_boxplot() + 
  scale_color_brewer(name = "Cohort:", palette = "Set1") + 
  labs(title = "Boxplots of height by cohort (by gender)",
       x = "Cohort (x 10 years)", y = "Height (in cm)", 
       caption = "[ds4psy: numeracy data]") + 
  coord_flip() + 
  theme_bw()


## (C) Assessing DVs: ----- 
data_f
#> # A tibble: 1,000 x 20
#>     name gender      bdate byear bmonth  bday   age summer_born bweekday
#>    <chr>  <chr>     <date> <int>  <int> <int> <dbl>       <lgl>    <chr>
#>  1  I.G.   male 1968-12-14  1968     12    14    49       FALSE      Sat
#>  2  O.B.   male 1974-04-10  1974      4    10    44        TRUE      Wed
#>  3  M.M.   male 1987-09-28  1987      9    28    30        TRUE      Mon
#>  4  V.J. female 1978-02-15  1978      2    15    40       FALSE      Wed
#>  5  O.E.   male 1985-05-18  1985      5    18    33        TRUE      Sat
#>  6  Q.W.   male 1968-03-01  1968      3     1    50       FALSE      Fri
#>  7  H.K.   male 1994-04-27  1994      4    27    24        TRUE      Wed
#>  8  T.R. female 1961-06-05  1961      6     5    57        TRUE      Mon
#>  9  F.J.   male 1983-10-01  1983     10     1    35       FALSE      Sat
#> 10  J.R. female 1941-12-29  1941     12    29    76       FALSE      Mon
#> # ... with 990 more rows, and 11 more variables: height <int>,
#> #   blood_type <chr>, bnt_1 <int>, bnt_2 <int>, bnt_3 <int>, bnt_4 <int>,
#> #   g_iq <int>, s_iq <int>, had_bday_this_year <lgl>,
#> #   no_bday_this_year_yet <lgl>, cohort <fctr>

# (h) BNT values:
data_h <- data_f %>%
  mutate(BNT = bnt_1 + bnt_2 + bnt_3 + bnt_4) %>%
  select(name:bnt_4, BNT, everything())

data_h %>%
  select(bnt_1:BNT)
#> # A tibble: 1,000 x 5
#>    bnt_1 bnt_2 bnt_3 bnt_4   BNT
#>    <int> <int> <int> <int> <int>
#>  1     1     0     0     1     2
#>  2     1     1     1    NA    NA
#>  3     0     1     0     0     1
#>  4     0     0     0     0     0
#>  5     1     0     0     0     1
#>  6     1     1     1     0     3
#>  7     0     1    NA     0    NA
#>  8     1     0     1     0     2
#>  9     0     0     0     0     0
#> 10     1     1     0     1     3
#> # ... with 990 more rows

mean(is.na(data_h$BNT)) # 78 or 7.8% missing values.
#> [1] 0.078

# (i) Inspecting g_iq and s_iq:
sum(is.na(data_h$g_iq)) # => 20 missing g_iq values
#> [1] 20
sum(is.na(data_h$s_iq)) # => 30 missing g_iq values
#> [1] 30

summary(data_h$g_iq, na.rm = TRUE) # mean of 101.9, range from 73 to 139. 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   73.00   95.75  101.00  101.91  108.00  139.00      20
summary(data_h$s_iq, na.rm = TRUE) # mean of 102.0, range from 70 to 131.  
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    70.0    95.0   102.0   102.0   109.8   131.0      30

# Save means and medians for plots:
mn_g_iq <- summary(data_h$g_iq, na.rm = TRUE)[["Mean"]]
md_g_iq <- summary(data_h$g_iq, na.rm = TRUE)[["Median"]]
mn_s_iq <- summary(data_h$s_iq, na.rm = TRUE)[["Mean"]]
md_s_iq <- summary(data_h$s_iq, na.rm = TRUE)[["Median"]]

ggplot(data_h) +
  geom_histogram(aes(x = g_iq), binwidth = 2, fill = "gold", color = "black") + 
  geom_vline(xintercept = mn_g_iq, linetype = 2, color = "steelblue") +  # mark mean by vertical dashed line 
  # geom_vline(xintercept = md_g_iq, linetype = 3, color = "steelblue") +  # mark median by vertical dotted line   
  labs(title = "Distribution of general intelligence values",
       x = "General intelligence", y = "Frequency", 
       caption = "[ds4psy: numeracy data]") + 
  theme_bw()


ggplot(data_h) +
  geom_histogram(aes(x = s_iq), binwidth = 2, fill = "steelblue", color = "black") +
  geom_vline(xintercept = mn_s_iq, linetype = 2, color = "gold") +  # mark mean by vertical dashed line 
  # geom_vline(xintercept = md_s_iq, linetype = 3, color = "gold") +  # mark median by vertical dotted line   
  labs(title = "Distribution of social intelligence values",
       x = "General intelligence", y = "Frequency", 
       caption = "[ds4psy: numeracy data]") + 
  theme_bw()


# (j) Relationship between g_iq and s_iq:
ggplot(data_h, aes(x = g_iq, y = s_iq)) +
  geom_point(position = "jitter", alpha = 1/3) +
  geom_smooth() +
  geom_abline(color = "forestgreen", linetype = 2) +  
  labs(title = "Social intelligence by general intelligence",
       x = "General intelligence", y = "Social intelligence", 
       caption = "[ds4psy: numeracy data]") + 
  # coord_fixed() + 
  theme_bw()


# => No clear relationship, but possibly a decline of s_iq with increasing g_iq.



# +++ here now +++

## (D) Exploring results: ----- 

# +++ here now +++

pt <- pt + 30

Task 15: Your data, your plot…

Now it’s your turn to find and plot some data! Find a table with interesting data online (e.g., at Wikipedia), load and save it (by copying or re-creating it) as a tibble in R, and then re-format, analyze, and visualize this tibble in one or several graphs. Your goal should be to illustrate some – but not necessarily all – key aspects of the data in a transparent fashion (i.e., making it easy to see some key observations, relationships between variables, or possible trends).

Please note: Your chosen data table should meet the following requirements:

  1. Make sure to select a data source that has a stable URL that is cited in your script (to allow verifying the source and integrity of your data).
  2. Your initial data table (tibble) should contain a minimum of 4 rows and 4 columns.

Examples of suitable sources include tables of

Hints: If you choose some data of interest to you, this task should be fun, rather than a chore…

(5 points + up to 5 bonus points)

Appendix

Creating data

Generating datasets with specific properties not yet used in the above tasks.

Create numeracy data

Create a (psychological) dataset numeracy that allows illustrating contents from all chapters.

Tasks

Basics:

  • Report the data dimensions and number of missing values.

  • Split the birth date (bdate) variable into 3 separate variables (byear, bmonth, and bday) that denote the year, month, and day of each participant’s birth.

  • Create a new variable summerborn that is TRUE when a person was born in summer (defined as April to September) and FALSE when a person was born in winter (October to March).

Assessing IVs:

  • Compute the current age of each participant (in years, taking into account today’s date).

  • List the frequency of each blood type (blood_type) (a) overall and (b) by gender.

  • Compute descriptives (the means and standard deviations) of height (a) by gender and (b) by cohort (decade of birth).

  • Plot the distributions of heights by gender and cohort.

Assessing DVs:

  • Compute aggregate BNT score (a BNT value from 0 to 4 points) for each participant and check whether it varies systematically as a function of gender,

  • Check the intelligence scores g_iq and s_iq:

    • How many missing values?
    • What are their means and ranges?
    • Plot their distributions in 2 histograms.
  • Plot the relationship between g_iq and s_iq.

  • Assess possible effects of numeracy (as measured by BNT) on the 2 types of intelligence (g_iq and s_iq).

  • Assess possible effects of the variables

    • age (age),
    • birth season (summerborn),
    • blood type (blood_type), and
    • the weekday on which a person was born (bweekday)

on the 2 types of intelligence (g_iq and s_iq) in 2 ways:

a. numercially (by computing group means) and 
b. graphically (by plotting means and/or distributions). 

Create exp data

Create an (experimental) dataset exp that contains typical features (e.g., multiple measurements with nested factors) of psychological studies.

n <- 10       # [n]umber of participants
set.seed(88)  # for replicability

IVs <- data.frame("name" = c("Ann", "Bea", "Cat", "Deb", "Eva", "Fred", "Gary", "Hans", "Ian", "John"),
                  "gender" = c(rep("f", 5), rep("m", 5)), 
                  "age" = sample(18:65, n, replace = TRUE)
                   )
IVs
#>    name gender age
#> 1   Ann      f  37
#> 2   Bea      f  22
#> 3   Cat      f  53
#> 4   Deb      f  41
#> 5   Eva      f  65
#> 6  Fred      m  65
#> 7  Gary      m  19
#> 8  Hans      m  54
#> 9   Ian      m  50
#> 10 John      m  64

## (a) within-subjects conditions (with multiple tasks per person):
DVs <- data.frame("task_1" = rep(c("red", "blue"), 5),
                   # "pos_1" = rep(1, n),
                   "time_1" = sample(10:99, n), 
                   "task_2" = rep(c("blue", "red"), 5),
                   # "pos_2" = rep(2, n),
                   "time_2" = sample(10:99, n)
                   )
DVs
#>    task_1 time_1 task_2 time_2
#> 1     red     59   blue     29
#> 2    blue     76    red     75
#> 3     red     58   blue     83
#> 4    blue     25    red     89
#> 5     red     11   blue     88
#> 6    blue     38    red     69
#> 7     red     18   blue     44
#> 8    blue     99    red     98
#> 9     red     71   blue     56
#> 10   blue     91    red     41


## (b) between-subjects conditions (with separate variables):

# DVs2 <- data.frame("cond" = c(rep("A", n/2), rep("B", n/2)),
#                    "A.num1" = c(sample(1:7, n/2), rep(NA, n/2)), 
#                    "B.num1" = c(rep(NA, n/2), sample(3:9, n/2)),
#                    "A.chr1" = c(sample(c("ABBA", "Beatles"), n/2, replace = TRUE), rep(NA, n/2)), 
#                    "B.chr1" = c(rep(NA, n/2), sample(c("ABBA", "Pink Floyd"), n/2, replace = TRUE))
#                    )

DVs2 <- data.frame("cond" = rep(c("A", "B"), n/2))

DVs2$A.num1 <- NA
DVs2$A.num1[DVs2$cond == "A"] <- c(sample(1:6, n/2, replace = TRUE))
DVs2$B.num1 <- NA
DVs2$B.num1[DVs2$cond == "B"] <- c(sample(4:9, n/2, replace = TRUE))

DVs2$A.chr1 <- NA
DVs2$A.chr1[DVs2$cond == "A"] <- c(sample(c("Abba", "Beatles"), n/2, replace = TRUE))
DVs2$B.chr1 <- NA
DVs2$B.chr1[DVs2$cond == "B"] <- c(sample(c("Beatles", "Zappa"), n/2, replace = TRUE))
DVs2
#>    cond A.num1 B.num1  A.chr1  B.chr1
#> 1     A      2     NA Beatles    <NA>
#> 2     B     NA      9    <NA> Beatles
#> 3     A      5     NA    Abba    <NA>
#> 4     B     NA      6    <NA> Beatles
#> 5     A      4     NA    Abba    <NA>
#> 6     B     NA      8    <NA> Beatles
#> 7     A      3     NA Beatles    <NA>
#> 8     B     NA      8    <NA>   Zappa
#> 9     A      5     NA    Abba    <NA>
#> 10    B     NA      9    <NA> Beatles

# Note that DVs encodes order (or chronological trial position) 
# implicitly (as 1st vs. 2nd entry) for every case.

## Combine IVs and DVs: 
exp <- cbind(IVs, DVs)
exp2 <- cbind(IVs, DVs2)

# exp
# exp2

dim(exp)
#> [1] 10  7
dim(exp2)
#> [1] 10  8

Use exp data

Basic manipulations of exp data (in base R):

exp.t <- exp # temporal copy

# 1. Adding 2 new variables/columns to a df by assigning/using it: 
exp.t$id <- 1:nrow(exp.t)
exp.t$bnt <- sample(1:4, n, replace = TRUE)
# exp.t

# 2. Swap some columns (e.g., putting id to front): 
id.col  <- which(names(exp.t) == "id")  # determine id column
bnt.col <- which(names(exp.t) == "bnt") # determine bnt column
exp.t <- exp.t[ , c(id.col, 1:(id.col - 1), bnt.col)]
exp.t
#>    id name gender age task_1 time_1 task_2 time_2 bnt
#> 1   1  Ann      f  37    red     59   blue     29   1
#> 2   2  Bea      f  22   blue     76    red     75   1
#> 3   3  Cat      f  53    red     58   blue     83   2
#> 4   4  Deb      f  41   blue     25    red     89   4
#> 5   5  Eva      f  65    red     11   blue     88   1
#> 6   6 Fred      m  65   blue     38    red     69   1
#> 7   7 Gary      m  19    red     18   blue     44   2
#> 8   8 Hans      m  54   blue     99    red     98   3
#>  [ reached getOption("max.print") -- omitted 2 rows ]

# 3. Sorting cases: 
exp.t[order(exp.t$task_1, exp.t$bnt),] # sort cases by task_1 and bnt values
#>    id name gender age task_1 time_1 task_2 time_2 bnt
#> 2   2  Bea      f  22   blue     76    red     75   1
#> 6   6 Fred      m  65   blue     38    red     69   1
#> 8   8 Hans      m  54   blue     99    red     98   3
#> 10 10 John      m  64   blue     91    red     41   3
#> 4   4  Deb      f  41   blue     25    red     89   4
#> 1   1  Ann      f  37    red     59   blue     29   1
#> 5   5  Eva      f  65    red     11   blue     88   1
#> 9   9  Ian      m  50    red     71   blue     56   1
#>  [ reached getOption("max.print") -- omitted 2 rows ]

# 4. Reversing cases and variables: 
#    Reversing the (arbitrary) order of all cases and all variables in df:
exp.t[rev(1:nrow(exp.t)), rev(1:ncol(exp.t))]
#>    bnt time_2 task_2 time_1 task_1 age gender name id
#> 10   3     41    red     91   blue  64      m John 10
#> 9    1     56   blue     71    red  50      m  Ian  9
#> 8    3     98    red     99   blue  54      m Hans  8
#> 7    2     44   blue     18    red  19      m Gary  7
#> 6    1     69    red     38   blue  65      m Fred  6
#> 5    1     88   blue     11    red  65      f  Eva  5
#> 4    4     89    red     25   blue  41      f  Deb  4
#> 3    2     83   blue     58    red  53      f  Cat  3
#>  [ reached getOption("max.print") -- omitted 2 rows ]

Same steps in the tidyverse:

library(tidyverse)

exp.t2 <- exp # temporal copy

# 1. Adding 2 new variables/columns to a df by assigning/using it: 
exp.t2 <- exp.t2 %>%
  mutate(id = 1:nrow(exp.t2),
         bnt = sample(1:4, n, replace = TRUE))
# exp.t2

# 2. Swap some columns (e.g., putting id to front): 
exp.t2 <- exp.t2 %>%
  select(id, everything())
# exp.t2

# 3. Sorting cases:
exp.t2 <- exp.t2 %>%
  arrange(task_1, bnt)
# exp.t2

# 4. Reversing cases and variables: 
#    Reversing the (arbitrary) order of all cases and all variables in df:
exp.t2 %>%
  mutate(row = 1:nrow(exp.t2)) %>%  # add helper column 
  arrange(desc(row)) %>%
  select(-row) %>%                  # remove helper column again 
  select(rev(names(exp.t2)))
#>    bnt time_2 task_2 time_1 task_1 age gender name id
#> 1    3     56   blue     71    red  50      m  Ian  9
#> 2    3     44   blue     18    red  19      m Gary  7
#> 3    3     83   blue     58    red  53      f  Cat  3
#> 4    2     88   blue     11    red  65      f  Eva  5
#> 5    1     29   blue     59    red  37      f  Ann  1
#> 6    3     41    red     91   blue  64      m John 10
#> 7    3     75    red     76   blue  22      f  Bea  2
#> 8    2     69    red     38   blue  65      m Fred  6
#>  [ reached getOption("max.print") -- omitted 2 rows ]

Counting tasks and points

This file contains 15 tasks for a total of 104 points (plus some possible bonus points).

[Last update on 2018-07-13 15:37:56 by hn.]